System for using statistical classifiers for spoken language understanding

ABSTRACT

The present invention involves using one or more statistical classifiers in order to perform task classification on natural language inputs. In another embodiment, the statistical classifiers can be used in conjunction with a rule-based classifier to perform task classification.

BACKGROUND OF THE INVENTION

[0001] The present invention deals with spoken language understanding.More specifically, the present invention deals with the use ofstatistical classification for spoken language understanding.

[0002] Natural language understanding is the process of receiving at acomputer an input expressed as a natural language input. The computerthen attempts to understand the meaning of the natural language inputand take any desired action based on the natural language input.

[0003] Two types of natural language inputs which interfaces haveattempted to accommodate in the past include type-in lines and speechinputs. Type-in lines simply include a field into which the user cantype a natural language expression. Speech inputs include a speechrecognition engine which receives a speech signal input by the user andgenerates a textual representation of the speech signal.

SUMMARY OF THE INVENTION

[0004] Natural user interfaces which can accept natural language inputsmust often gain two levels of understanding of the input in order tocomplete an action (or task) based on the input. First, the system mustclassify the user input to one of a number of different classes ortasks. This involves first generating a list of tasks which the user canrequest and then classifying the user input to one of those differenttasks.

[0005] Next, the system must identify semantic items in the naturallanguage input. The semantic items correspond to the specifics of adesired task.

[0006] By way of example, if the user typed in a statement “Send anemail to John Doe.” Task classification would involve identifying thetask associated with this input as a “SendMail” task and the semanticanalysis would involve identifying the term “John Doe” as the“recipient” of the electronic mail message to be generated.

[0007] Statistical classifiers are generally considered to be robust andcan be easily trained. Also, such classifiers require little supervisionduring training, but they often suffer from poor generalization whendata is insufficient. Grammar-based robust parsers are expressive andportable, and can model the language in granularity. These parsers areeasy to modify by hand in order to adapt to new language usages. Whilerobust parsers yield an accurate and detailed analysis when a spokenutterance is covered by the grammar, they are less robust for thosesentences not covered by the training data, even with robustunderstanding techniques.

[0008] One embodiment of the present invention involves using one ormore statistical classifiers in order to perform task classification onnatural language inputs. In another embodiment, the statisticalclassifiers can be used in conjunction with a rule-based classifier toperform task classification.

[0009] While an improvement in task classification itself is helpful andaddresses the first level of understanding that a natural languageinterface must demonstrate, task classification alone may not providethe detailed understanding of the semantics required to complete sometasks based on a natural language input. Therefore, another embodimentof the present invention includes a semantic analysis component as well.This embodiment of the invention uses a rule-based understanding systemto obtain a deep understanding of the natural language input. Thus, theinvention can include a two pass approach in which classifiers are usedto classify the natural language input into one or more tasks and thenrule-based parsers are used to fill semantic slots in the identifiedtasks.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010]FIG. 1 is a block diagram of one illustrative environment in whichthe present invention can be used.

[0011]FIG. 2 is a block diagram of a portion of a natural languageinterface in accordance with one embodiment of the present invention.

[0012]FIG. 3 illustrates another embodiment in which multiplestatistical classifiers are used.

[0013]FIG. 4 illustrates another embodiment in which multiple, cascadedstatistical classifiers are used.

[0014]FIG. 5 is a block diagram illustrating another embodiment in whichnot only one or more statistical classifiers are used for taskclassification, and a rule-based analyzer is also used for taskclassification.

[0015]FIG. 6 is a block diagram of a portion of a natural languageinterface in which task classification and more detailed semanticunderstanding are obtained in accordance with one embodiment of thepresent invention.

[0016]FIG. 7 is a flow diagram illustrating the operation of the systemshown in FIG. 6.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS Overview

[0017] Two different aspects of the present invention involve performingtask classification on a natural language input and performing semanticanalysis on a natural language input in conjunction with taskclassification in order to obtain a natural user interface. However,prior to discussing the invention in more detail, one embodiment of anexemplary environment in which the present invention can be implementedwill be discussed.

[0018]FIG. 1 illustrates an example of a suitable computing systemenvironment in which the invention may be implemented. The computingsystem environment is only one example of a suitable computingenvironment and is not intended to suggest any limitation as to thescope of use or functionality of the invention. Neither should thecomputing environment be interpreted as having any dependency orrequirement relating to any one or combination of components illustratedin the exemplary operating environment.

[0019] The invention is operational with numerous other general purposeor special purpose computing system environments or configurations.Examples of well known computing systems, environments, and/orconfigurations that may be suitable for use with the invention include,but are not limited to, personal computers, server computers, hand-heldor laptop devices, multiprocessor systems, microprocessor-based systems,set top boxes, programmable consumer electronics, network PCs,minicomputers, mainframe computers, distributed computing environmentsthat include any of the above systems or devices, and the like.

[0020] The invention may be described in the general context ofcomputer-executable instructions, such as program modules, beingexecuted by a computer. Generally, program modules include routines,programs, objects, components, data structures, etc. that performparticular tasks or implement particular abstract data types. Theinvention may also be practiced in distributed computing environmentswhere tasks are performed by remote processing devices that are linkedthrough a communications network. In a distributed computingenvironment, program modules may be located in both local and remotecomputer storage media including memory storage devices.

[0021] With reference to FIG. 1, an exemplary system for implementingthe invention includes a general purpose computing device in the form ofa computer 110. Components of computer 110 may include, but are notlimited to, a processing unit 120, a system memory 130, and a system bus121 that couples various system components including the system memoryto the processing unit 120. The system bus 121 may be any of severaltypes of bus structures including a memory bus or memory controller, aperipheral bus, and a local bus using any of a variety of busarchitectures. By way of example, and not limitation, such architecturesinclude Industry Standard Architecture (ISA) bus, Micro ChannelArchitecture (MCA) bus, Enhanced ISA (EISA) bus, Video ElectronicsStandards Association (VESA) local bus, and Peripheral ComponentInterconnect (PCI) bus also known as Mezzanine bus.

[0022] Computer 110 typically includes a variety of computer readablemedia. Computer readable media can be any available media that can beaccessed by computer 110 and includes both volatile and nonvolatilemedia, removable and non-removable media. By way of example, and notlimitation, computer readable media may comprise computer storage mediaand communication media. Computer storage media includes both volatileand nonvolatile, removable and non-removable media implemented in anymethod or technology for storage of information such as computerreadable instructions, data structures, program modules or other data.Computer storage media includes, but is not limited to, RAM, ROM,EEPROM, flash memory or other memory technology, CD-ROM, digitalversatile disks (DVD) or other optical disk storage, magnetic cassettes,magnetic tape, magnetic disk storage or other magnetic storage devices,or any other medium which can be used to store the desired informationand which can be accessed by computer 100. Communication media typicallyembodies computer readable instructions, data structures, programmodules or other data in a modulated data signal such as a carrier WAVor other transport mechanism and includes any information deliverymedia. The term “modulated data signal” means a signal that has one ormore of its characteristics set or changed in such a manner as to encodeinformation in the signal. By way of example, and not limitation,communication media includes wired media such as a wired network ordirect-wired connection, and wireless media such as acoustic, FR,infrared and other wireless media. Combinations of any of the aboveshould also be included within the scope of computer readable media.

[0023] The system memory 130 includes computer storage media in the formof volatile and/or nonvolatile memory such as read only memory (ROM) 131and random access memory (RAM) 132. A basic input/output system 133(BIOS), containing the basic routines that help to transfer informationbetween elements within computer 110, such as during startup, istypically stored in ROM 131. RAM 132 typically contains data and/orprogram modules that are immediately accessible to and/or presentlybeing operated on by processing unit 120. By way of example, and notlimitation, FIG. 1 illustrates operating system 134, applicationprograms 135, other program modules 136, and program data 137.

[0024] The computer 110 may also include other removable/non-removablevolatile/nonvolatile computer storage media. By way of example only,FIG. 1 illustrates a hard disk drive 141 that reads from or writes tonon-removable, nonvolatile magnetic media, a magnetic disk drive 151that reads from or writes to a removable, nonvolatile magnetic disk 152,and an optical disk drive 155 that reads from or writes to a removable,nonvolatile optical disk 156 such as a CD ROM or other optical media.Other removable/non-removable, volatile/nonvolatile computer storagemedia that can be used in the exemplary operating environment include,but are not limited to, magnetic tape cassettes, flash memory cards,digital versatile disks, digital video tape, solid state RAM, solidstate ROM, and the like. The hard disk drive 141 is typically connectedto the system bus 121 through a non-removable memory interface such asinterface 140, and magnetic disk drive 151 and optical disk drive 155are typically connected to the system bus 121 by a removable memoryinterface, such as interface 150.

[0025] The drives and their associated computer storage media discussedabove and illustrated in FIG. 1, provide storage of computer readableinstructions, data structures, program modules and other data for thecomputer 110. In FIG. 1, for example, hard disk drive 141 is illustratedas storing operating system 144, application programs 145, other programmodules 146, and program data 147. Note that these components can eitherbe the same as or different from operating system 134, applicationprograms 135, other program modules 136, and program data 137. Operatingsystem 144, application programs 145, other program modules 146, andprogram data 147 are given different numbers here to illustrate that, ata minimum, they are different copies.

[0026] A user may enter commands and information into the computer 110through input devices such as a keyboard 162, a microphone 163, and apointing device 161, such as a mouse, trackball or touch pad. Otherinput devices (not shown) may include a joystick, game pad, satellitedish, scanner, or the like. These and other input devices are oftenconnected to the processing unit 120 through a user input interface 160that is coupled to the system bus, but may be connected by otherinterface and bus structures, such as a parallel port, game port or auniversal serial bus (USB). A monitor 191 or other type of displaydevice is also connected to the system bus 121 via an interface, such asa video interface 190. In addition to the monitor, computers may alsoinclude other peripheral output devices such as speakers 197 and printer196, which may be connected through an output peripheral interface 190.

[0027] The computer 110 may operate in a networked environment usinglogical connections to one or more remote computers, such as a remotecomputer 180. The remote computer 180 may be a personal computer, ahand-held device, a server, a router, a network PC, a peer device orother common network node, and typically includes many or all of theelements described above relative to the computer 110. The logicalconnections depicted in FIG. 1 include a local area network (LAN) 171and a wide area network (WAN) 173, but may also include other networks.Such networking environments are commonplace in offices, enterprise-widecomputer networks, Intranets and the Internet.

[0028] When used in a LAN networking environment, the computer 110 isconnected to the LAN 171 through a network interface or adapter 170.When used in a WAN networking environment, the computer 110 typicallyincludes a modem 172 or other means for establishing communications overthe WAN 173, such as the Internet. The modem 172, which may be internalor external, may be connected to the system bus 121 via the user-inputinterface 160, or other appropriate mechanism. In a networkedenvironment, program modules depicted relative to the computer 110, orportions thereof, may be stored in the remote memory storage device. Byway of example, and not limitation, FIG. 1 illustrates remoteapplication programs 185 as residing on remote computer 180. It will beappreciated that the network connections shown are exemplary and othermeans of establishing a communications link between the computers may beused.

[0029] It should be noted that the present invention can be carried outon a computer system such as that described with respect to FIG. 1.However, the present invention can be carried out on a server, acomputer devoted to message handling, or on a distributed system inwhich different portions of the present invention are carried out ondifferent parts of the distributed computing system.

Overview of Task Classification System

[0030]FIG. 2 is a block diagram of a portion of a natural languageinterface 200. System 200 includes a feature selection component 202 anda statistical classifier 204. System 200 can also include optionalspeech recognition engine 206 and optional preprocessor 211. Whereinterface 200 is to accept speech signals as an input, it includesspeech recognizer 206. However, where interface 200 is simply to receivetextual input, speech recognizer 206 is not needed. Also, preprocessing(as discussed below) is optional. The present discussion will proceedwith respect to an embodiment in which speech recognizer 206 andpreprocessor 211 are present, although it will be appreciated that theyneed not be present in other embodiments. Also, other natural languagecommunication modes can be used, such as handwriting or other modes. Insuch cases, suitable recognition components, such as handwritingrecognition components, are used.

[0031] In order to perform task classification, system 200 firstreceives an utterance 208 in the form of a speech signal that representsnatural language speech spoken by a user. Speech recognizer 206 performsspeech recognition on utterance 208 and provides, at its output, naturallanguage text 210. Text 210 is a textual representation of the naturallanguage utterance 208 received by speech recognizer 206. Speechrecognizer 206 can be any known speech recognition system which performsspeech recognition on a speech input. Speech recognizer 206 may includean application-specific dictation language model, but the particular wayin which speech recognizer 206 recognizes speech does not form any partof the invention. Similarly, in another embodiment, speech recognizer206 outputs a list of results or interpretations with respectiveprobabilities. Later components operate on each interpretation and usethe associated probabilities in task classification.

[0032] Natural language text 210 can optionally be provided topreprocessor 211 for preprocessing and then to feature selectioncomponent 202. Preprocessing is discussed below with respect to featureselection. Feature selection component 202 identifies features innatural language text 210 (or in each text 210 in the list of resultsoutput by the speech recognizer) and outputs feature vector 212 basedupon the features identified in text 210. Feature selection component202 is discussed in greater detail below. Briefly, feature selectioncomponent 202 identifies features in text 210 that can be used bystatistical classifier 204.

[0033] Statistical classifier 204 receives feature vector 212 andclassifies the feature vector into one or more of a plurality ofpredefined classes or tasks. Statistical classifier 202 outputs a taskor class identifier 214 identifying the particular task or class towhich statistical classifier 204 has assigned feature vector 212. This,of course, also corresponds to the particular class or task to which thenatural language input (utterance 208 or natural language text 210)corresponds. Statistical classifier 204 can alternatively output aranked list (or n-best list) of task or class identifiers 214.Statistical classifier 204 will also be described in greater detailbelow. The task identifier 214 is provided to an application or othercomponent that can take action based on the identified task. Forexample, if the identified task is to SendMail, identifier 214 is sentto the electronic mail application which can, in turn, display anelectronic mail template for use by the user. Of course, any other taskor class is contemplated as well. Similarly, if an n-best list ofidentifiers 214 is output, each item in the list can be displayedthrough a suitable user interface such that a user can select thedesired class or task.

[0034] It can thus be seen that system 200 can perform at least thefirst level of understanding required by a natural language interface--that is, identifying a task represented by the natural language input.

Feature Selection

[0035] A set of features must be selected for extraction from thenatural language input. The set of features will illustratively be thosefound to be most helpful in performing task classification. This can beempirically, or otherwise, determined.

[0036] In one embodiment, the natural language input text 210 isembodied as a set of words. One group of features will illustrativelycorrespond to the presence or absence of words in the natural languageinput text 210, wherein only words in a certain vocabulary designed fora specific application are considered, and words outside the vocabularyare mapped to a distinguished word-type such as <UNKNOWN>. Therefore,for example, a place will exist in feature vector 212 for each word inthe vocabulary (including the <UNKNOWN> word), and its place will befilled with a value of 1 or 0 depending upon whether the word is presentor not in the natural language input text 210, respectively. Thus, thebinary feature vector would be a vector having a length corresponding tothe number of words in the lexicon (or vocabulary) supported by thenatural language interface.

[0037] Of course, it should be noted that many other features can beselected as well. For example, the co-occurrences of words can befeatures. This may be used, for instance, in order to more explicitlyidentify tasks to be performed. For example, the co-occurrence of thewords “send mail” may be a feature in the feature vector. If these twowords are found, in this order, in the input text, then thecorresponding feature in the feature vector is marked to indicate thefeature was present in the input text. A wide variety of other featurescan be selected as well, such as bi-grams, tri-grams, other n-grams, andany other desired features.

[0038] Similarly, preprocessing can optionally be performed on naturallanguage text 210 by preprocessor 211 in order to arrive at featurevector 212. For instance, it may be desirable that the feature vector212 only indicate the presence or absence of words that have beenpredetermined to carry semantic content. Therefore, natural languagetext 210 can be preprocessed to remove stop words and to maintain onlycontent words, prior to the feature selection process. Similarly,preprocessor 211 can include rule-based systems (discussed below) thatcan be used to tag certain semantic items in natural language text 210.For instance, the natural language text 210 can be preprocessed so thatproper names are tagged as well as the names of cities, dates, etc. Theexistence of these tags can be indicated as a feature as well.Therefore, they will be reflected in feature vector 212. In anotherembodiment, the tagged words can be removed and replaced by the tags.

[0039] In addition stemming can also be used in feature selection.Stemming is a process of removing morphological variations in words toobtain their root forms. Examples of morphological variations includeinflectional changes (such as pluralization, verb tense, etc.) andderivational changes that alter a word's grammatical role (such asadjective versus adverb as in slow versus slowly, etc.) Stemming can beused to condense multiple features with the same underlying semanticsinto single features. This can help overcome data sparseness, improvecomputational efficiency, and reduce the impact of the featureindependence assumptions used in statistical classification methods.

[0040] In any case, feature vector 212 is illustratively a vector whichhas a size corresponding to the number of features selected. The stateof those features in natural language input text 210 can then beidentified by the bit locations corresponding to each feature in featurevector 212. While a number of features have been discussed, this shouldnot be intended to limit the scope of the present invention anddifferent or other features can be used as well.

Task or Class Identification (Text Classification)

[0041] Statistical classifiers are very robust with respect to unseendata. In addition, they require little supervision in training.Therefore one embodiment of the present invention uses statisticalclassifier 204 to perform task or class identification on the featurevector 212 that corresponds to the natural language input. A widevariety of statistical classifiers can be used as classifier 204, anddifferent combinations can be used as well. The present discussionproceeds with respect to Naive Bayes classifiers, task-dependent n-gramlanguage models, and support vector machines. The present discussionalso proceeds with respect to a combination of statistical classifiers,and a combination of statistical classifiers and a rule-based system fortask or class identification.

[0042] The following description will proceed assuming that the featurevector is represented by w and it has a size V (which is the size of thevocabulary supported by system 200) with binary elements (or features)equal to one if the given word is present in the natural language inputand zero otherwise. Of course, where the features include not only thevocabulary or lexicon but also other features (such as those mentionedabove with respect to feature selection) the dimension of the featurevector will be different.

[0043] The Naive Bayes classifier receives this input vector and assumesindependence among the features. Therefore, given input vector w, itstarget class can be found by choosing the class with the highestposterior probability: $\begin{matrix}\begin{matrix}{\hat{c} = {{\underset{c}{\arg \quad \max}\quad {P\left( c \middle| w \right)}} = {\arg \quad \max \quad {P(c)}{P\left( w \middle| c \right)}}}} \\{= {\underset{c}{\arg \quad \max}\quad {P(c)}\quad {\prod\limits_{i = 1}^{V}{{P\left( {w_{i} = \left. 1 \middle| c \right.} \right)}^{\delta {({{wi},1})}}{P\left( {w_{i} = \left. 0 \middle| c \right.} \right)}^{\delta {({{wi},0})}}}}}}\end{matrix} & {{Eq}.\quad 1}\end{matrix}$

[0044] Where P (c|w) is the probability of a class given the sentence(represented as the feature vector w);

[0045] P(c) is the probability of a class;

[0046] P(w|c) is the conditional probability of the feature vectorextracted from a sentence given the class c;

[0047] P(wi=1|c) or P(wi=0|c) is the conditional probability that wordwi is observed or not observed, respectively, in a sentence that belongsto class c;

[0048] δ(wi,1)=1, if wi=1 and 0 otherwise; and

[0049] δ(wi,0)=1, if wi=0 and 0 otherwise.

[0050] In other words, according to Equation 1, the classifier picks theclass c that has the greatest probability P(c|w) as the target class forthe natural language input. Where more than one target class is to beidentified, then the top n probabilities calculated usingP(c|w)=P(c)P(w|c) will correspond to the top n classes represented bythe natural language input.

[0051] Because sparseness of data may be a problem, P(w_(i)|c) can beestimated as follows: $\begin{matrix}{{P\left( {w_{i} = \left. 1 \middle| c \right.} \right)} = \frac{N_{c}^{i} + b}{N_{c} + {2b}}} & {{Eq}.\quad 2}\end{matrix}$

 P(wi=0|c)=1−P(w _(i)=1|c)   Eq. 3

[0052] where N_(c) is the number of natural language inputs for class cin the training data;

[0053] N^(i) _(c) is the number of times word i appeared in the naturallanguage inputs in the training data;

[0054] P(w_(i)=1|c) is the conditional probability that the word iappears in the natural language textual input given class c; and

[0055] P(w_(i)=0|c) is the conditional probability that the word i doesnot appear in the input given class c; and

[0056] b is estimated as a value to smooth all probabilities and istuned to maximize the classification accuracy of cross-validation datain order to accommodate unseen data. Of course, it should be noted thatb can be made sensitive to different classes as well, but mayillustratively simply be maximized in view of cross-validation data andbe the same regardless of class.

[0057] Also, it should again be noted that when using a Naïve Bayesclassifier the feature vector can be different than simply all words inthe vocabulary. Instead, preprocessing can be run on the naturallanguage input to remove unwanted words, semantic items can be tagged,bi-grams, tri-grams and other word co-occurrences can be identified andused as features, etc.

[0058] Another type of classifier which can be used as classifier 204 isa set of class-dependent n-gram statistical language model classifiers.If the words in the natural language input 210 are viewed as values of arandom variable instead of binary features, Equation 1 can be decomposedin a different way as follows: $\begin{matrix}\begin{matrix}{\hat{c} = {\underset{c}{\arg \quad \max}\quad {P(c)}\quad {P\left( w \middle| c \right)}}} \\{= {\underset{c}{\arg \quad \max}\quad {P(c)}\quad {\prod\limits_{i = 1}^{w}{P\left( {\left. w_{i} \middle| c \right.,{w_{{i - 1},}w_{{i - 2},\cdots \quad,w_{1}}}} \right)}}}}\end{matrix} & {{Eq}.\quad 4}\end{matrix}$

[0059] where |w| is the length of the text w, and Markov independenceassumptions of orders 1, 2 and 3 can be made to use a task-specificuni-gram P(w_(i)|c), bi-gram P(w_(i)|c,w_(i)−1) or tri-gram P(w_(i)|c,w_(i)−1, w_(i)−2), respectively.

[0060] One class-specific model is generated for each class c.Therefore, when a natural language input 210 is received, theclass-specific language models P(w|c) are run on the natural languageinput 210, for each class. The output from each language model ismultiplied by the prior probability for the respective class. The classwith the highest resulting value corresponds to the target class.

[0061] While this may appear to be highly similar to the Naive Bayesclassifier discussed above, it is different. For example, whenconsidering n-grams, word co-occurrences of a higher order are typicallyconsidered than when using the Naive Bayes classifier. For example,tri-grams require looking at word triplets whereas, in the Naive Bayesclassifier, this is not necessarily the case.

[0062] Similarly, even if only uni-grams are used, in the n-gramclassifier, it is still different than the Naive Bayes classifier. Inthe Naive Bayes Classifier, if a word in the vocabulary occurs in thenatural language input 210, the feature value for that word is a 1,regardless of whether the word occurs in the input multiple times. Bycontrast, the number of occurrences of the word will be considered inthe n-gram classifier.

[0063] In accordance with one embodiment, the class-specific n-gramlanguage models are trained by splitting sentences in a training corpusamong the various classes for which n-gram language models are beingtrained. All of the sentences corresponding to each class are used intraining an n-gram classifier for that class. This yields a number c ofn-gram language models, where c corresponds to the total number ofclasses to be considered.

[0064] Also, in one embodiment, smoothing is performed in training then-gram language models in order to accommodate for unseen training data.The n-gram probabilities for the class-specific training models areestimated using linear interpolation of relative frequency estimates atdifferent orders (such as 0 for a uniform model . . . , n for a n-grammodel). The linear interpolation weights at different orders arebucketed according to context counts and their values are estimatedusing maximum likelihood techniques on cross-validation data. The n-gramcounts from the cross-validation data are then added to the countsgathered from the main training data to enhance the quality of therelative frequency estimates. Such smoothing is set out in greaterdetail in Jelinek and Mercer, Interpolated Estimation of Markov SourceParameters From Sparse Data, Pattern Recognition in Practice, Gelsemaand Kanal editors, North-Holland (1980).

[0065] Support vector machines can also be used as statisticalclassifier 204. Support vector machines learn discriminatively byfinding a hyper-surface in the space of possible inputs of featurevectors. The hyper-surface attempts to split the positive examples fromthe negative examples. The split is chosen to have the largest distancefrom the hyper-surface to the nearest of the positive and negativeexamples. This tends to make the classification correct for test datathat is near, but not identical to, the training data. In oneembodiment, sequential minimal optimization is used as a fast method totrain support vector machines.

[0066] Again, the feature vector can be any of the feature vectorsdescribed above, such as a bit vector of length equal to the vocabularysize where the corresponding bit in the vector is set to one if the wordappears in the natural language input, and other bits are set to 0. Ofcourse, the other features can be selected as well and preprocessing canbe performed on the natural language input prior to feature vectorextraction, as also discussed above. Also, the same techniques discussedabove with respect to cross validation data can be used during trainingto accommodate for data sparseness.

[0067] The particular support vector machine techniques used aregenerally known and do not form part of the present invention. Oneexemplary support vector machine is described in Burger, C. J. C., ATutorial on Support Vector Machines for Pattern Recognition, Data Miningand Discovery, 1998, 2(2) pp. 121-167. One technique for performingtraining of the support vector machines as discussed herein is set outin Platt, J. C., Fast Training of Support Vector Machines UsingSequential Minimal Optimization, Advances in Kernel Methods—SupportVector Learning, B. Scholkopf, C. J. C. Burger, and A. J. Smola,editors, 1999, pp. 185-208.

[0068] Another embodiment of statistical classifier 204 is shown in FIG.3. In the embodiment shown in FIG. 3, statistical classifier component204 includes a plurality of individual statistical classifiers 216, 218and 220 and a selector 221 which is comprised of a voting component 222in FIG. 3. The statistical classifiers 216-220 are different from oneanother and can be the different classifiers discussed above, or others.Each of these statistical classifiers 216-220 receives feature vector212. Each classifier also picks a target class (or a group of targetclasses) which that classifier believes is represented by feature vector212. Classifiers 216-220 provide their outputs to class selector 221. Inthe embodiment shown in FIG. 3, selector 221 is a voting component 222which simply uses a known majority voting technique to output as thetask or class ID 214, the ID associated with the task or class mostoften chosen by statistical classifiers 216-220 as the target class.Other voting techniques can be used as well. For example, when theclassifiers 216-220 do not agree with one another, it may be sufficientto choose the output of a most accurate one of the classifiers beingused, such as the support vector machine. In this way, the results fromthe different classifiers 216-220 can be combined for betterclassification accuracy.

[0069] In addition, each of classifiers 216-220 can output a ranked listof target classes (an n-best list). In that case, selector 221 can usethe n-best list from each classifier in selecting a target class or itsown n-best list of target classes.

[0070]FIG. 4 shows yet another embodiment of statistical classifier 204shown in FIG. 2. In the embodiment shown in FIG. 4, a number of theitems are similar to those shown in FIG. 3, and are similarly numbered.However, selector 221, which was a voting component 222 in theembodiment shown in FIG. 3, is an additional statistical classifier 224in the embodiment shown in FIG. 4. Statistical classifier 224 is trainedto take, as its input feature vector, the outputs from the otherstatistical classifiers 216-220. Based on this input feature vector,classifier 224 outputs the task or class ID 214. This further improvesthe accuracy of classification.

[0071] It should also be noted, of course, that the selector 221 whichultimately selects the task or class ID could be other components aswell, such as a neural network or a component other than the votingcomponent 222 shown in FIG. 3 and the statistical classifier 224 shownin FIG. 4.

[0072] In order to train the class or task selector 221 training data isprocessed. The selector takes as an input feature vector the outputsfrom the statistical classifiers 216-220 along with the correct classfor the supervised training data. In this way, the selector 221 istrained to generate a correct task or class ID based on the inputfeature vector.

[0073] In another embodiment, each of the statistical classifiers216-220 not only output a target class or a set of classes, but also acorresponding confidence measure or confidence score which indicates theconfidence that the particular classifier has in its selected targetclass or classes. Selector 221 can receive the confidence measure bothduring training, and during run time, in order to improve the accuracywith which it identifies the task or class corresponding to featurevector 212.

[0074]FIG. 5 illustrates yet another embodiment of classifier 204. Anumber of the items shown in FIG. 5 are similar to those shown in FIGS.3 and 4, and are similarly numbered. However, FIG. 5 shows thatclassifier 204 can include non-statistical components, such asnon-statistical rule-based analyzer 230. Analyzer 230 can be, forexample, a grammar-based robust parser. Grammar-based robust parsers areexpressive and portable, can model the language in various granularity,and are relatively easy to modify in order to adapt to new languageusages. While they can require manual grammar development or moresupervision in automatic training for grammar acquisition and while theymay be less robust in terms of unseen data, they can be useful toselector 221 in selecting the accurate task or class ID 214.

[0075] Therefore, rule-based analyzer 230 takes, as an input, naturallanguage text 210 and provides, as its output, a class ID (andoptionally, a confidence measure) corresponding to the target class.Such a classifier can be a simple trigger-class mapping heuristic (wheretrigger words or morphs in the input 210 are mapped to a class), or aparser with a semantic understanding grammar.

Class Identification and Semantic Interpretation

[0076] Task classification may, in some instances, be insufficient tocompletely perform a task in applications that need more detailedinformation. A statistical classifier, or combination of multipleclassifiers as discussed above, can only identify the top-level semanticinformation (such as the class or task) of a sentence. For example, sucha system may identify the task corresponding to the natural languageinput sentence “List flights from Boston to Seattle” as the task“ShowFlights”. However, the system cannot identify the detailed semanticinformation (i.e., the slots) about the task from the users utterance,such as the departure city (Boston) and the destination city (Seattle).

[0077] The example below shows the semantic representation for thissentence: <ShowFlight text=“list flights from Boston to Seattle”>  <Flight>     <City text=“Boston” name=“Depart”/>     <Citytext=“Seattle” name=“Arrive”/>   </Flight> </ShowFlight>

[0078] In this example, the name of the top-level frame (i.e., the classor task) is “ShowFlight”. The paths from the root to the leaf, such as<ShowFlight> <Flight> <City text=“Boston” name=“Depart”/>, are slots inthe semantic representation. The statistical classifiers discussed aboveare simply unable to fill the slots identified in the task or class.

[0079] Such high resolution understanding has conventionally beenattempted with a semantic parser that uses a semantic grammar in anattempt to match the input sentences against grammar that models bothtasks and slots. However, in such a conventional system, the semanticparser is simply not robust enough, because there are often unexpectedinstances of commands that are not covered by the grammar.

[0080] Therefore, FIG. 6 illustrates a block diagram of a portion of anatural language interface system 300 which takes advantage of both therobustness of statistical classifiers and the high resolution capabilityof semantic parsers. System 300 includes a number of things which aresimilar to those shown in previous figures, and are similarly numbered.However, system 300 also includes robust parser 302 which outputs asemantic interpretation 303. Robust parser 302 can be any of thosementioned in Ward, W. Recent Improvements in the CMU Spoken LanguageUnderstanding System, Human Language Technology Workshop 1994,Plansborough, N.J.; Wang, Robust Spoken Language Understanding in MiPad,Eurospeech 2001, Aalborg, Denmark; Wang, Robust Parser for SpokenLanguage Understanding, Eurospeech 1999, Budapest, Hungry; Wang, AceroEvaluation of Spoken Language Grammar Learning in ATIS Domain, ICASSP2002, Orlando, Fla.; Or Wang, Acero, Grammar Learning for SpokenLanguage Understanding, IEEE Workshop on Automatic Speech Recognitionand Understanding, 2001, Madonna Di Capiglio, Italy.

[0081]FIG. 7 is a flow diagram that illustrates the operation of system300 shown in FIG. 6. The operation of blocks 208-214 shown in FIG. 6operate in the same fashion as described above with respect to FIGS.2-5. In other words, where the input received is a speech or voiceinput, the utterance is received as indicated by block 304 in FIG. 7 andspeech recognition engine 206 performs speech recognition on the inpututterance, as indicated by block 306. Then, input text 210 canoptionally be preprocessed by preprocessor 211 as indicated by block 307in FIG. 7 and is provided to feature extraction component 202 whichextracts feature vector 212 from input text 210. Feature vector 212 isprovided to statistical classifier 204 which identifies the task orclass represented by the input text. This is indicated by block 308 inFIG. 7.

[0082] The task or class ID 214 is then provided, along with the naturallanguage input text 210, to robust parser 302. Robust parser 302dynamically modifies the grammar such that the parsing component inrobust parser 302 only applies grammatical rules that are related to theidentified task or class represented by ID 214. Activation of theserules in the rule-based analyzer 302 is indicated by block 310 in FIG.7.

[0083] Robust parser 302 then applies the activated rules to the naturallanguage input text 210 to identify semantic components in the inputtext. This is indicated by block 312 in FIG. 7.

[0084] Based upon the semantic components identified, parser 302 fillsslots in the identified class to obtain a semantic interpretation 302 ofthe natural language input text 210. This is indicated by block 314 inFIG. 7.

[0085] Thus, system 300 not only increases the accuracy of the semanticparser because task ID 214 allows parser 302 to work more accurately onsentences with structure that was not seen in the training data, but italso speeds up parser 302 because the search is directed to a subspaceof the grammar since only those rules pertaining to task or class ID 214are activated.

[0086] It can thus be seen that different aspects of the presentinvention can be used to obtain improvements in both phases ofprocessing natural language in natural language interfaces: identifyinga task represented by the natural language input (text classification)and filling semantic slots in the identified task. The task can beidentified using a statistical classifier, multiple statisticalclassifiers, or a combination of statistical classifiers and rule-basedclassifiers. The semantic slots can be filled by a robust parser byfirst identifying the class or task represented by the input and thenactivating only rules in the grammar used by the parser that relate tothat particular class or task.

[0087] Although the present invention has been described with referenceto particular embodiments, workers skilled in the art will recognizethat changes may be made in form and detail without departing from thespirit and scope of the invention.

What is claimed is:
 1. A text classifier in a natural language interfacethat receives a natural language user input, the text classifiercomprising: a feature extractor extracting a feature vector from atextual input indicative of the natural language user input; astatistical classifier coupled to the feature extractor outputting aclass identifier identifying a target class associated with the textualinput based on the feature vector.
 2. The text classifier of claim 1wherein the statistical classifier comprises: a plurality of statisticalclassification components each outputting a class identifier.
 3. Thetext classifier of claim 2 wherein the statistical classifier comprises:a class selector coupled to the plurality of statistical classificationcomponents and selecting one of the class identifiers as identifying thetarget class.
 4. The text classifier of claim 3 wherein the classselector comprises a voting component.
 5. The text classifier of claim 3wherein the class selector comprises an additional statisticalclassifier.
 6. The text classifier of claim 1 and further comprising: arule-based classifier receiving the textual input and outputting a classidentifier; and a selector selecting at least one of the classidentifiers as identifying the target class.
 7. The text classifier ofclaim 1 and further comprising: a rule-based parser receiving thetextual input and the class identifier and outputting a semanticrepresentation of the textual input.
 8. The text classifier of claim 7wherein the semantic representation includes a class having slots, theslots being filled with semantic expressions.
 9. The text classifier ofclaim 1 and further comprising: a pre-processor identifying words in thetextual input having semantic content.
 10. The text classifier of claim9 wherein the preprocessor is configured to remove words from thetextual input that have insufficient semantic content.
 11. The textclassifier of claim 9 wherein the preprocessor is configured to inserttags for words in the textual input, the tags being semantic labels forthe words.
 12. The text classifier of claim 1 wherein the feature vectoris based on words in a vocabulary supported by the natural languageinterface.
 13. The text classifier of claim 12 wherein the featurevector is based on n-grams of the words in the vocabulary.
 14. The textclassifier of claim 12 wherein the feature vector is based on words inthe vocabulary having semantic content.
 15. The text classifier of claim1 wherein the statistical classifier comprises a Naive Bayes Classifier.16. The text classifier of claim 1 wherein the statistical classifiercomprises a support vector machine.
 17. The text classifier of claim 1wherein the statistical classifier comprises a plurality ofclass-specific statistical language models.
 18. The text classifier ofclaim 1 wherein a number c of classes are supported by the naturallanguage interface and wherein the statistical classifier comprises cclass-specific statistical language models.
 19. The text classifier ofclaim 1 and further comprising: a speech recognizer receiving a speechsignal indicative of the natural language input and providing thetextual input.
 20. The text classifier of claim 1 wherein thestatistical classifier identifies a plurality of n-best target classes.21. The text classifier of claim 20 and further comprising: an outputdisplaying the n-best target classes for user selection.
 22. The textclassifier of claim 2 wherein each statistical classifier outputs aplurality of n-best target classes.
 23. A computer-implemented method ofprocessing a natural language input for use in completing a taskrepresented by the natural language input, comprising: performingstatistical classification on the natural language input to obtain aclass identifier for a target class associated with the natural languageinput; identifying rules in a rule-based analyzer based on the classidentifier; and analyzing the natural language input with the rule-basedanalyzer using the identified rules to fill semantic slots in the targetclass.
 24. The method of claim 23 and further comprising: prior toperforming statistical classification, identifying words in the naturallanguage input that have semantic content.
 25. The method of claim 23wherein the natural language input is represented by a speech signal andfurther comprising: performing speech recognition on the speech signalprior to performing statistical classification.
 26. The method of claim23 wherein performing statistical classification comprises: performingstatistical classification on the natural language input using aplurality of different statistical classifiers; and selecting a classidentifier output by one of the statistical classifiers as representingthe target class.
 27. The method of claim 26 wherein selectingcomprises: performing statistical classification on the classidentifiers output by the plurality of statistical classifiers to selectthe class identifier that represents the target class.
 28. The method ofclaim 26 wherein selecting comprises: selecting the class identifieroutput by a greatest number of the plurality of statistical classifiers.29. The method of claim 23 and further comprising: performing rule-basedanalysis on the natural language input to obtain a class identifier; andidentifying the target class based on the class identifier obtained fromthe statistical classification and the class identifier obtained fromthe rule-based analysis.
 30. A system for identifying a task to beperformed by a computer based on a natural language input, comprising: afeature extractor extracting features from the natural language input;and a statistical classifier, trained to accommodate unseen data,receiving the extracted features and identifying the task based on thefeatures.
 31. The system of claim 30 wherein the statistical classifierand wherein probabilities used by the statistical classifier aresmoothed using smoothing data to accommodate for the unseen data. 32.The system of claim 31 wherein smoothing data is obtained usingcross-validation data.
 33. A text classifier identifying a target classcorresponding to a natural language input, comprising: a featureextractor extracting a set of features from the natural language input;and a Naïve Bayes Classifier receiving the set of features andidentifying the target class based on the set of features.
 34. The textclassifier of claim 33 wherein the target class is indicative of a taskto be performed based on the natural language input.
 35. The textclassifier of claim 34 and further comprising: a preprocessoridentifying content words in the natural language input prior to thefeature extractor extracting the set of features.
 36. The textclassifier of claim 35 wherein the preprocessor identifies the contentwords by removing from the natural language input words havinginsufficient semantic content.
 37. A text classifier identifying atarget class corresponding to a natural language input, comprising: afeature extractor extracting a set of features from the natural languageinput; and a statistical language model classifier receiving the set offeatures and identifying the target class based on the set of features.38. The text classifier of claim 37 wherein the set of features includesn-grams.
 39. The text classifier of claim 37 and further comprising: apreprocessor identifying content words in the natural language inputprior to the feature extractor extracting the set of features.
 40. Atext classifier identifying one or more target classes corresponding toa natural language input, comprising: a feature extractor extracting aset of features from the natural language input; and a plurality ofstatistical classifiers receiving the set of features and identifying atarget class based on the set of features.
 41. The text classifier ofclaim 40 wherein each statistical classifier outputs a class identifierbased on the set of features and further comprising: a selectorreceiving the class identifiers from each of the statistical classifiersand selecting the target class as a class identified by at least one ofthe class identifiers.
 42. The text classifier of claim 40 and furthercomprising: a preprocessor identifying content words in the naturallanguage input prior to the feature extractor extracting the set offeatures.
 43. A text classifier identifying a target class correspondingto a natural language input, comprising: a feature extractor extractinga set of features from the natural language input; a statisticalclassifier receiving the set of features and outputting a classidentifier based on the set of features; a rules based classifieroutputting a class identifier based on the natural language input; and aselector selecting a target class based on the class identifiers outputby the statistical classifier and the rule-based classifier.
 44. Thetext classifier of claim 43 and further comprising: a preprocessoridentifying content words in the natural language input prior to thefeature extractor extracting the set of features and prior to therule-based classifier receiving the natural language input.
 45. A textclassifier identifying a target task to be completed corresponding to anatural language input, comprising: a feature extractor extracting a setof features from a textual input indicative of the natural languageinput; a statistical classifier receiving the set of features andidentifying the target task based on the set of features; and arule-based parser receiving the textual input and a class identifierindicative of the identified target task and outputting a semanticrepresentation of the textual input.
 46. The text classifier of claim 45wherein the rule-based parser is configured to identify semanticexpressions in the textual input.
 47. The text classifier of claim 46wherein the semantic representation includes a class having slots, theslots being filled with the semantic expressions.
 48. The textclassifier of claim 45 and further comprising: a pre-processoridentifying words in the textual input having semantic content.
 49. Thetext classifier of claim 48 wherein the preprocessor is configured toremove words from the textual input that have insufficient semanticcontent.
 50. The text classifier of claim 48 wherein the preprocessor isconfigured to insert tags for words in the textual input, the tags beingsemantic labels for the words.
 51. The text classifier of claim 48wherein the preprocessor is configured to replace words in the textualinput with semantic tags, the semantic tags being semantic labels forthe words.